In [302]:
%matplotlib inline
%load_ext rpy2.ipython
from apiclient.discovery import build
from apiclient.errors import HttpError
from oauth2client.tools import argparser

# Set DEVELOPER_KEY to the API key value from the APIs & auth > Registered apps
# tab of
#   https://cloud.google.com/console
# Please ensure that you have enabled the YouTube Data API for your project.
DEVELOPER_KEY = "AIzaSyBfV1CYhferAUl0UzrH9-_Fv3eFHlIqM3M"
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"
youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION, developerKey=DEVELOPER_KEY)
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
In [303]:
%%R
# Load R libraries silently in slide not shown
library(dplyr)
library(ggplot2)
library(reshape2)
library(stringr)

Data Analysis Examples

  • YouTube Video Ranking
    • Bayesian Rankings vs ML Rankings

  • Caribbean Crime Forecasting
    • Hierarchical Modeling
    • Gaussian Processes

  • Paint By Numbers
    • Mixture Models
    • Infinite Mixture Models

  • Putting it all together

Ranking YouTube Videos

  1. Collect some data
  2. Determine the best ones by looking at which are "liked" with the highest probability

Done. Maybe?

In [304]:
def get_videos(query, max_results):
    response = youtube.search().list(
        q=query,
        part="id,snippet",
        maxResults=max_results,
        type="video"
    ).execute()

    videos = {}
    for result in response.get("items", []):
        if result["id"]["kind"] == "youtube#video":
            videos[result['id']['videoId']] = result['snippet']['title']

    return videos
In [305]:
def get_statistics(videos):
    response = youtube.videos().list(
        id=','.join(videos.keys()),
        part="id,statistics",
        maxResults=len(videos)
    ).execute()

    res = []
    for stat in response.get('items', []):
        vstat = {'id': stat['id'], 'title': videos[stat['id']]}
        vstat.update(stat['statistics'])
        res.append(vstat)
        
    return pd.DataFrame(res).set_index('id').convert_objects(convert_numeric=True)

Get Some Data

In [306]:
videos = get_videos('office+pranks', 50)
stats = get_statistics(videos)[['title', 'likeCount', 'dislikeCount']]
stats.head(15)
Out[306]:
title likeCount dislikeCount
id
q4360lNoW7A The Dental Office [FULL] - Pranks and Hidden C... 54 17
TEpB8JZ8SWg Moosejaw's Office Pranks 6 0
hiW1pDzWG-s The Office Jim and Dwight pranks 1080p 50 5
9G9vgT8bEb0 Office Prank Revenge! - office pranks 2633 223
gmxQSwwTRqU [FULL] Japanese Dinosaur Prank 2298 87
U1PHpkdvNOs Gareth's Stapler - The Office - BBC 1203 71
jihMdGOXO3E How to Do the Autocorrect Prank | Office Pranks 71 5
q0XdthbOkMU Best Office Prank Ever 13063 1898
cY1vzC62ZWo How to Do the Mayo Donut Prank | Office Pranks 48 6
tbSaenItxNU Mother of all Pranks | Fake cops raid company ... 1836 154
wDqYcvFKlFU Day 4: Office Pranks 2 0
fYq2_fu30tI Cousin Sal's Surprise Ice Bucket Challenge 2798 62
OcIXOFnYlyQ Dinosaur Prank Made in Japan T Rex Fail !! 1186 97
jtrjY8Yoor0 Office Pranks and Shenanigans at ITS HQ 300 18
4aViH-dpU2I EN - Max Payne 3 - s4 - Office pranks! 3 1

Rank by likeCount Ratio

In [309]:
stats['totalVotes'] = stats['likeCount'] + stats['dislikeCount']
stats['p']  = stats['likeCount'].astype(np.float64) / stats['totalVotes']
stats.sort(['p', 'totalVotes'], ascending=False).head(25)
Out[309]:
title likeCount dislikeCount totalVotes p
id
zN7KrZuIZbI The office - Jim's best pranks 17 0 17 1.000000
e6eqdhjFeMg How To: Office Pranks by Moosejaw 15 0 15 1.000000
d1w8-jqZCbs Rise Guys Off Air 7/10/15 Office pranks, takin... 11 0 11 1.000000
TEpB8JZ8SWg Moosejaw's Office Pranks 6 0 6 1.000000
klbHdec-V1I Office Pranks | Super Gluing Desk Drawers | Ti... 6 0 6 1.000000
wDqYcvFKlFU Day 4: Office Pranks 2 0 2 1.000000
7yVAVDkyUFs Dwight Schrute's Desk // Jim Vs Dwight Pranks ... 289 1 290 0.996552
WDxeQ4G1ZSw Cousin Sal's No-Prank Prank 3833 27 3860 0.993005
xLxHtBt2jtU Asian Jim // Jim Vs Dwight // The Office US 3566 28 3594 0.992209
fYq2_fu30tI Cousin Sal's Surprise Ice Bucket Challenge 2798 62 2860 0.978322
-UACtIzxv_U 6 Harmless Office Pranks 14403 383 14786 0.974097
glFrp-CmNVA Stapler in Jelo // Jim Vs Dwight Pranks // The... 398 11 409 0.973105
W2o5JhF38Aw WKUK - Office Pranks 3499 101 3600 0.971944
8kvbtbfAq8I 10 Best April Fools Pranks 22511 663 23174 0.971390
QCgDzUtLkCA Japanese Dinosaur Prank 3026 106 3132 0.966156
xLyh_y5c0-A 12 Evil Pranks Taken To The Next Level (Photos) 21027 754 21781 0.965383
gmxQSwwTRqU [FULL] Japanese Dinosaur Prank 2298 87 2385 0.963522
4xQb9Kl-O3E LG Ultra HD 84" TV PRANK (METEOR EXPLODES DURI... 21424 814 22238 0.963396
uq9F9h-kuIw Air Horn + Office Chair Prank (Extended Cut) -... 49 2 51 0.960784
PDzinoEspBI Office prank gone wrong! 17785 782 18567 0.957882
t5SLR1qLu-4 10 OFFICE PRANKS!! - HOW TO PRANK 18083 841 18924 0.955559
M4ML7jFV8NA Office Prank - Making an Office Disappear. 341 17 358 0.952514
U1PHpkdvNOs Gareth's Stapler - The Office - BBC 1203 71 1274 0.944270
jtrjY8Yoor0 Office Pranks and Shenanigans at ITS HQ 300 18 318 0.943396
JgNJciPuCdI Best Pranks Compilation 5664 393 6057 0.935116
In [310]:
videos = stats.copy()
videos['title'] = videos['title'].str[:64]
videos = videos.rename(columns={'likeCount': 'likes', 'dislikeCount':'dislikes', 'totalVotes': 'n'})
videos = videos.reset_index()
videos.head()
Out[310]:
id title likes dislikes n p
0 q4360lNoW7A The Dental Office [FULL] - Pranks and Hidden C... 54 17 71 0.760563
1 TEpB8JZ8SWg Moosejaw's Office Pranks 6 0 6 1.000000
2 hiW1pDzWG-s The Office Jim and Dwight pranks 1080p 50 5 55 0.909091
3 9G9vgT8bEb0 Office Prank Revenge! - office pranks 2633 223 2856 0.921919
4 gmxQSwwTRqU [FULL] Japanese Dinosaur Prank 2298 87 2385 0.963522
In [115]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('xLxHtBt2jtU', 800, 600)
Out[115]:

Video Ranking Model

What we're really doing:

pi = Probability that an individual viewer will like video i,
Li = # of Likes for video i,
Di = # of Dislikes for video i
ni=Li+Di

Model Realistic? Sort of.

Maximum Likelihood Estimate for pi = Lini

Good estimate? Yea, but not when the sample size (i.e. ni) is small (e.g. 1/1, 2/2, 0/1)

How can we model the probability of a like but somehow account for sample size?

The Beta Distribution

It's only defined on 0 to 1 and has two parameters, plenty enough to manipulate the shape to be what we want.

Probability Density Function:

p(x)=xα1(1x)β1B(α,β),
where B is the Beta Function

pi

Done

source activate research3.3 ipython nbconvert /Users/eczech/repos/users_ericczech/Code/IPython3/meetup_pres.ipynb --to slides mv meetup_pres.slides.html ~/Sites/notebooks/

In [ ]: